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GRAVITATIONAL ATTRACTOR ENGINE FOR ADAPTIVELY 
AUTOCLUSTERING N— DIMENSIONAL DATASTREAMS 



This application is a continuation-in-part of U.S. 
Serial No. 751,020, filed August 28. 1991. 

Field of the Invention 

This invention relates to a method for classifying 
multi-parameter data in real time (or from recorded data) 
into cluster groups for the purpose of defining different 
populations of particles in a sample. This invention is 
particularly useful in the field of flow cytometry wherein 
multi-parameter data is recorded for each cell that passes 
through an illumination and sensing region. It is 

especially useful for classifying and counting 
immunofluorescently labeled CD3, CD4 and CD8 lymphocytes in 
blood samples from AIDS patients. 

Background of the Invention 

Particle analysis generally comprises the analysis of 
cells, nuclei, chromosomes and other particles for the 
purpose of identifying the particles as members of different 
populations and/or sorting the particles into different 
populations. This type of analysis includes automated 
analysis by means of image and flow cytometry. In either 
instance, the particle, such as a cell, may be labeled with 
one or more markers and then examined for the presence or 
absence of one or more such markers. In the case of a cell, 
such as a leukocyte, tumor cell or microorganism, the marker 
can be directed to a molecule on the cell surface or to a 
molecule in the cytoplasm. Examination of a cell's physical 
characteristics, as well as the presence or absence of 
marker(s), provides additional information which can be 
useful in identifying the population to which a cell 
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belongs. 



Cytometry comprises a well known methodology usrng 
m ulti-parameter data for identifying and distinguishing 
between different cell types in a sample. For example, the 
sample may be drawn from a variety of biological fluids, 
such as blood, lymph or urine, or may be derrved from 
suspensions of cells from hard tissues such as colon, lung, 
breast, kidney or liver. In a flow cytometer, cells are 
passed, in suspension, substantially one at a time through 
one or more sensing regions where in each region each cell 
is illuminated by an energy source. The energy source 
generally comprises an illumination means that emxts Ixght 
of a single wavelength such as that provided by a laser 
(e .q., He/Ne or argon) or a mercury arc lamp wxth 
appropriate filters. Light at 488nm is a generally used 
wavelength of emission in a flow cytometer having a single 
sensing region. 

in series with a sensing region, multiple light 
collection means, such as photomultiplier tubes (or "PMT" ) , 
are used to record light that passes through each cell 
(generally referred to as forward light scatter), light that 
is reflected orthogonal to the direction of the flow of the 
cells through the sensing region (generally referred to as 
orthogonal or side light scatter) and fluorescent Ixght 
emitted from the cell, if it is labeled with fluorescent 
marker(s), as the cell passes through the sensing region and 
is illuminated by the energy source. Each of forward Ixght 
scatter (or FSC) , orthogonal light scatter (SSC), and 
fluorescence emissions (Fid, FL2, etc.) comprise a separate 
parameter for each cell (or each "event"). Thus, for 
example, two, three or four parameters can be collected (and 
recorded) from a cell labeled with two different 
fluorescence markers. 

Flow cytometers further comprise data acquisition, 
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analysis and recording means , such as a computer, wherein 
multiple data channels record data from each PMT for the 
light scatter and fluorescence emitted by each cell as it 
passes through the sensing region. The purpose of the 
analysis system is to classify and count cells wherein each 
cell presents itself as .a set of digitized parameter values. 
Typically, by current analysis methods, the data collected 
in real time (or recorded for later analysis) is plotted in 
2-D space for ease of visualization. Such plots are 
referred to as "dot plots" and a typical example of a dot 
plot drawn from light scatter data recorded for leukocytes 
is shown in FIG- 1 of U.S. Pat. No. 4,987,086. By plotting 
orthogonal light scatter versus forward light scatter, one 
can distinguish between granulocytes, monocytes and 
lymphocytes in a population of leukocytes isolated from 
whole blood. By electronically (or manually) "gating" on 
only lymphocytes using light scatter, for example, and by 
the use of the appropriate monoclonal antibodies labeled 
with f luorochromes of different emission wavelength, one can 
further distinguish between cell types within the lymphocyte 
population ( e.g. , between T helper cells and T cytotoxic 
cells). U.S. Pat. Nos. 4,727,020, 4,704,891, 4,599,307 and 
4,987,086 describe the arrangement of the various components 
that comprise a flow cytometer, the general principles of 
use and one approach to gating on cells in order to 
discriminate between populations of cells in a blood sample. 

Of particular interest is the analysis of cells from 
patients infected with HIV, the virus which causes AIDS. It 
is well known that CD4 + T lymphocytes play an important role 
in HIV infection and AIDS. For example, counting the number 
of CD4 + T lymphocytes in a sample of blood from an infected 
individual will provide an indication of the progress of the 
disease* A cell count under 400 per mm 3 is an indication 
that the patient has progressed from being seropositive to 
AIDS. In addition to counting CD4+ T lymphocytes, CD8+ T 
lymphocytes also have been counted and a ratio of CD4:CD8 



WO 93/05478 



4 



PCT/US92/07291 



cells- has been used in understanding AIDS. 

in both cases, a sample of whole blood is obtained from 
a patient. Monoclonal antibodies against CD3 (a pan-T 
lymphocyte marker) ; CD4 and CDS are labeled directly or 
indirectly with a fluorescent dye- These dyes have emxssxon 
spectra that are distinguishable from each other. (Examples 
of such dyes are set forth in example 1 of U.S. Pat. Mo. 
4 745 285.) The labeled cells then are run on the flow 
cytometer and data is recorded. Analysis of the data can 
proceed in real time or be stored in list mode for later 
analysis . 

While data analyzed in 2-D space can yield discrete 
populations of cells, most often the dot plots represent 
projections of multiple clusters. As a result, often xt xs 
difficult to distinguish between cells which fall into 
regions of apparent overlap between clusters. In such 
cases, cells can be inadvertently classified in a wrong 
cluster, and thus, contribute inaccuracy to the populatxon 
counts and percentages being reported. In blood from an HIV 
infected patient for example, over- inclusion of T cells as 
being CD4+ could lead a clinician to believe a patient had 
not progressed to AIDS, and thus, certain treatment whxch 
otherwise might be given could be withheld. In cancers, 
such as leukemia, certain residual tumor cells might remaxn 
in the bone marrow after therapy. These residual cells are 
present in very low frequencies (i^, their presence xs 
rare and thus their occurrence in a large sample xs a "rare 
event"), and thus, their detection and classification are 
both difficult and important. 

Current data analysis methods fail to provide sufficient 
means to discriminate between clusters of cells, and thus, 
fail to permit more accurate identification and/or sorting 
of cells into different populations. In addition, such 
methods fail to predict if the preparative conditions used 
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by the technician were done properly (e.g. , improper 
staining techniques leading to non-specific staining or 
pipetting improper amounts of reagent ( s ) and/or sample ( s ) ) . 
Finally , most methods work well for mononuclear preparations 
from whole blood or on erythrocyte lysed whole blood but 
perform poorly on unlysed whole blood because of the over 
abundance of red cells and debris in a sample. 

Summary of the Invention 

The autoclustering method, described herein as the 
"gravitational attractor engine", addresses the need to 
automatically assign classifications to multi-parameter 
events as they arrive from an array of sensors such as the 
light collection means of a cytometer. It also functions in 
the post-classification of recordings of multi-parameter 
events in list-mode or database format. It is particularly 
useful in clustering Z-parameter data from CD3 and CD4 as 
well as CD3 and CD8 T cells labeled with immunof luorescent 
markers in blood samples from AIDS patients. 

The gravitational attractor consists of a geometric 
boundary surface of fixed size, shape and orientation, but 
of variable position, a computational engine by which the 
boundary surface positions itself optimally to enclose a 
cluster of multi-parameter events. Multiple attractors may 
be employed simultaneously for the purposes of classifying 
multiple clusters of events within the same datastream or 
recorded data distribution, the strategy being to assign one 
attractor per population to be identified and/or sorted. 
Classification of events in the datastream consists of a 
two-step process: In the first step (pre-analysis ) , the 
datastream is analyzed for purposes of precisely centering 
each attractor' s membership boundary surface about the 
statistical center-of-mass of the data cluster (i.e. , 
population) it is intending to classify. Pre-analysis is 
terminated after a pre-detennined number of events have been 
analyzed or if significant deviations in an attractor 
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position is found. In the second step (classification), 
each attractor's membership boundary is "locked down in 
place", and incoming datastream events are tested agaxnst 
membership boundaries for classification inclusion vs. 
exclusion. 

Major benefits of the gravitational attractor engine are 
that it: 1) requires no list-mode recording of events in 
the process of their classification (i^, data may be 
analyzed in real time); 2) provides a classification method 
tolerant of between- sample drift in the central value of a 
data cluster which may arise from any arbitrary combination 
of instrumentation, sample-preparation and intrinsic sample 
variance sources; 3) exhibits stability in the case of 
multiple missing clusters and can count particles in a 
population down to absolute zero in the vicinity of where 
the cluster is expected to locate; and 4) provides 
continuous access to population vector means and membership 
counts during sampling of the datastream, allowing 
continuous process quality assurance (or "PQA" ) during time- 
consuming, rare-event assays. 

Several extensions to the gravitational attractor engine 
increase its benefits: 1) hyperspherical boundary surfaces 
can be elongated on a preferred axis to obtain a cigar- 
shaped attractor; 2) the boundary surface used to gate 
events for gravitational interaction during pre-analysis can 
be different in shape and extent from the membership 
boundary applied during classification; and 3) the subset of 
parameters used to cluster events can be different for 
different attractors, allowing smear-inducing parameters to 
be ignored and permitting data classification at varying 
degrees of dimensional collapse. 

The primary . advantage of the gravitational attractor 
engine is its' capacity for accurate and efficient 
autoclustering, that is, it can replace manual-clustering 
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methods which require human judgment to adapt gating 
geometry to normal variances in the positions of target 
clusters. By comparison, prior autoclustering methods which 
rely on histogram curve analysis to locate threshold-type 
separators are less-robust in the handling of missing 
populations (especially multiple missing clusters). 

A cigar-shaped attractor engine performs well at 
classifying diagonally-elongated clusters whose "stretch" 
originates from partially-correlated (i.e. , uncompensated) 
events. By comparison, prior methods utilizing 1-D 

histogram analysis do not work as well with uncompensated 
clusters because their 1-D histogram projections consume 
excessive curvespace. Since an attractor can be defined in 
arbitrary N-dimensional space, the problem of overlapping 
clusters may be redressed through the addition of extra 
parameters to tease them apart at no additional 
computational complexity. The simplicity and highly 

parallel nature of the attractor engine's computations, 
together with its stream-oriented data interaction, makes 
this autoclassif ication method ideally suited to real time 
classification performed on high-event rate, multi-parameter 
datastreams . Compared to prior methods which require 
remembering a list-mode recording in order to perform data 
analysis, the attractor engine's memory requirements are 
small and unrelated to the datastream length being sampled, 
thus making practicable routine analyses in which several 
million events are sampled. The salient benefit of such 
mega-assays in cellular diagnostics is to detect diseased 
cells at thresholds as low as 1 per million normal cells 
f i.e, , rare-event assays), thus, enabling earlier detection 
and milder interventions to arrest disease* 

Description of the Drawings 

FIG. 1 illustrates two multi-dimensional attractor s (one 
spherical and one cigar-shaped) at their seed locations in 
multi-space prior to pre-analysis . FIG. 1 depicts two such 
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projective scatterpiots (5 and 6), showing the spherxcal 
attractor's centroid (1), radius (2) and orbital band (7), 
and the cigar attractor's centerline (3), radius (4) and 
orbital band ( 8 ) . 

FIG. 2 illustrates the same two attractors, by the same 
projection scatterpiots, at their cent er-of -mass locations 
in multi-space during classification. 

FIG. 3 comprises a series of colored 2-D dot plots of 
FSC vs. SCC (A), log fluorescence FITC vs. log fluorescence 
PE (B), and log fluorescence FITC vs. log fluorescence PerCp 
for data collected in list mode from erythrocyte whole blood 
to which different f luorescently labeled monoclonal 
antibodies have been added. The three gravitational 
attractors and their respective seed locations are shown 
prior to autoclustering. The blue dots and boundaries 
identify the NK cell attractor; the red dots and boundaries 
identify B lymphocyte attractor; and the green dots and 
boundaries identify T lymphocyte attractor. 

FIG. 4 comprises the colored 2-D dot plots as set forth 
in FIG. 3 post analysis showing the autoclustered 
populations and final positions of the attractors. The gray 
dots represent unclustered events (e^., monocytes, 
granulocytes and debris) in the sample. 

FIG. 5 comprises two dot plot of log PE version by 
PE/Cy5 fluorescence showing three autoclustered populations 
from a sample of unlysed whole blood from a AIDS patient to 
which a solution containing a known concentration of 
fluorescently labeled microbeads and f luorescently labeled 
(A) anti-CD3 and anti-CD4 monoclonal antibodies or (B) anti- 
CD3 and anti-CD8 monoclonal antibodies have been added. 

FIG. 6 comprises a dot plot as in FIG. 5, however, the 
blood is taken from a normal individual but the sample has 
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been rejected by PQA. 

Detailed Description 

A gravitational attractor is a small computational 
"engine". Initially, it contains one or more geometric 
parameters set by the user for each type of sample to be 
analyzed or fixed to define an expected target cluster's 
shape, size and approximate location. The attractor engine 
further comprises a method for locating a cluster's actual 
center-of-mass in the datastream being analyzed, and to 
subsequently classify events in the arriving datastream 
which satisfy the attractor 's geometric membership 
predicate. The term "gravitational" is apt because the 
attractor finds its optimal location enclosing the data 
cluster by falling to its center-of-mass location under the 
accumulative gravitational force of events in proximity to 
its expected location in multi-space. The term "attractor"/ 
drawn from dynamical systems theory, refers to the behavior 
of a system whereby a multitude of initial state vectors 
move toward, and converge upon a common, equilibrium end- 
state vector. In this case, the state vector corresponds to 
the instantaneous vector location of a roving geometric 
boundary surface (specifically a rigidly- attached reference 
point within it ) , as the boundary moves from an initial , 
expected "seed" location to equilibrium at a data cluster's 
actual center-of-mass location. 

The gravitational attractor described below illustrates 
the simplest case of membership geometry, the hyper sphere. 
The engine of a spherical attractor comprises the following 
fixed and variable components: 

Fixed components: s seed, or initial centroid 

vector of hypersphere 
representing approximate 
expected location of 

cluster 
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r radius of hypersphere 

variable components-. c current centroid vector of 

hypersphere 
n number of gravitationally- 

interacting events so far 
within the current 

datastream 

Before a datastream begins, the invariant aspects of the 
target cluster are first specified in terms of seed location, 
s, and radius, r. The specifications of s and r are made by 
observing projections of the cluster in 2-D projection 
scatterplots, whereby two coordinates of s are adjusted at a 
time using an 2-D locator device, and r is edited by 
-pulling" on its appearance with a locator device until 
satisfactory. 

The events in the datastream encountered consist of a 
variable number of multi-parameter events ei where i indexes 
the number (or sequence) of the event in the stream and e is 
the vector of parameter values comprising that particular 
event. Prior to analyzing the datastream, c is initialized 
to the seed location, s. 

Attractor autoclustering of the datastream comprises a 
two-step process: In the first step, pre-analysis, the 
datastream is analyzed for purposes of precisely centerxng 
the attractor 's membership boundary surface about the 
statistical center-of-mass of the data cluster it is 
intending to classify. Upon arrival of the first event, and 
that of each subsequent event during pre-analysis, the 
spherical attractor transforms each event vector into its own 
local coordinate system, whose origin is based at c: 



local = 



ei - c 



(transformation 
coordinates ) 



to local 
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Next, the attractor decides whether local e£ is short 
enough in length (e ± is close enough in proximity to c) to be 
allowed attractive pull on c. The interaction gating 
predicate, g, evaluates affirmatively if local e± has vector 
length less than r: 

g (local ejj = length (local e^) < r 

If the above proximity test is met, e£ is permitted to 
exert an increment of attractive pull on c (i^e., to enter 
into the center-of-mass calculation)- The center-of-mass of 
a lone cluster in an otherwise vacuous dataspace can be 
defined simply as the vector-mean of all N event vectors, 

c = z / N (center-of-mass for lone 

cluster) 

In multi-cluster distributions, each cluster applies its 
interaction gating function, g, whose job is to protect its 
centroid calculation from the influence of density pockets 
elsewhere in space: 

c = I e£ * g (local ejj / N 

Rather than update c continuously with each interaction 
(an inefficient approach prone to instability the case of 
missing clusters), the attractor' s centroid, c, is updated on 
a fixed schedule at prescribed interaction count milestones 
( i >e , f si, s2, s3 • • . sm). For this purpose, the attractor 
keeps a running vector sum sigma of all its interacting event 
vectors. At the start of pre-analysis, sigma and n are 
zeroed. During pre-analysis, each arriving event vector 
which satisfies the above gating predicate is accumulated, by 
vector addition, into sigma, the interaction count n is 
incremented , 



WO 93/05478 



12 



PCT/US92/07291 



- ^oma + • (effect of each event 

sigma = sxgma + i v 

interaction) 

n = n + 1 

and if n is one of the scheduled update milestones (e^, SI) 
the centroid is updated 

, „ (effect of centroid update) 

whereby the new value of c is the running vector sum sigma, 
scalar divided by n. At the completion of each update, c 
contains the vector mean of all events which have so far 
interacted with the attractor. This new, refined value of c, 
which carries the weight of more data than did the its 
previous value, governs subsequent interaction gatxng until 
the next update milestone is reached. 

The initial seed point, s, serves as a default centroid 
to get the calculation started. It should reflect the best 
available information about expected cluster position. Once 
the gated vector sum has accumulated some actual data (e^., 
si = 50), the computed centroid, c, replaces s as the best- 
available central value for anchoring the interaction gate. 

local e ± = ei - c (computed c replaces s) 

The update schedule for c subserves the goal of only 

. .. _ mKp first attractor 

improving its accuracy over time. The nrsr 

update milestone si is called the threshold of inertxa. It 

must be overcome in order to replace the seed value 

check on wandering. If a cluster were depleted down to a 

handful of events, and the centroid were allowed to update 

on the first gated event, and that event fell just insxde the 

gate, the updated gate could be dislocated up to a distance r 

from the seed point, possibly excluding centrist events from 

further consideration. If the threshold of inertia cannot be 

surmounted, no positional refinement is allowed (x^, the 
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seed value , s, specifies default emplacement of cluster 
membership geometry). Consequently, clusters which have 

become so depleted that no density landmark can be 
established are default-gated about the point where they were 
expected to have been found* 

If the threshold of inertia can been surmounted, the 
attractor is allowed to gravitate toward the local center-of- 
mass. Periodic centroid updates ( e.g. , every 50 

interactions) would move the attractor toward a convergence 
point , but a more efficient update schedule observes the 
statistical rule that residual error diminishes as the 
inverse square root of the number of interactions. 
Therefore , a parabolic update schedule ( e.q, , si = 100, s2 = 
400, s3 = 900, s4 = 1600 . . .) provides statistically 
significant centroid corrections on every update, whereas 
periodic updates take the centroid along a more oscillatory 
path toward the same eventual outcome. 

The cessation of the pre-analysis activity for a single 
attractor is triggered by either attainment of the number of 
interactions, sin, specified as the final scheduled update 
milestone (the attractor' s "interaction quota"), or a global 
t im e-out metered in time or total events acquired, which ever 
comes first. If multiple attractors are interacting with the 
pre-analysis datastream, attractors which have reached their 
interaction quotas lay dormant while awaiting the attainment 
of quota by all other attractors, or the global time-out, 
whichever comes first. If pre-analysis is terminated by 
global time-out, each attractor which fell short its 
interaction quota but which surpassed its threshold of 
inertia is given a final centroid update, so that event 
interactions accumulated since its last previous update are 
represented in the final value of the centroid, c. The 
specification of a global time-out, as a function of time or 
total acquired events, is necessary to guarantee termination 
of datastream pre-analysis unless there are apriori 
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guarantees of sufficient population, for each target cluster 
L each datastream sample to always guarantee termxnatxon by 
satisfaction of interaction quotas. 

Attractor-based autoclustering is a 2-step process. In 
the second step, "classification", each attractor 's 
hyperspherical membership boundary is locked down m place at 
± Z centroid, c, frozen after the last pre-analysxs centroxd 
update was completed (or at s if no update took place). As 
each subsequent incoming event arrives in the continuatxon of 
the same datastream which was pre-analyzed , the xncomxng 
event is tested against each membership boundary for 
classification inclusion vs. exclusion, and a membership 
count incremented at each inclusion decision. If multiple 
classification and counting of the same event is unnatural or 
undesirable, one provides a contention resolving mechanxsm to 
assure that each event is classified and counted but by one 
attractor. A straightforward mechanism is to prxorxtxze 
competing classifications, another is to award membershxp 
based on closest Euclidean proximity. One distinct advantage 
of prioritized classifications is that it can easily extend 
to attractors with more complex geometry's which can overlap 
in more complex ways, and for this reason it has been adopted 
into practice. 

During classification, the membership count so far 
accumulated by each attractor is available for deciding when 
enough target events have been counted to terminate the 
assay. These accretional counts may be used for early 
detection of missing clusters, for example, indicating a 
sample-preparation omission which is cause for abortxng the 
assay. 

The cessation of classification is triggered by 
attainment of ••membership quotas" for all attractors, or a 
global time-out expressed in time or total acquired events 
during the classification phase. 
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Both during and after the classification phase, each 
attractor holds its cluster population count and centroid 
(location) vectors, and thus, provides additional benefits to 
data analyses- Such benefits include quality-assurance 
mechanisms by which the user can define acceptable vs. 
aberrant datastream distributions, and automatically have the 
latter flagged. 

A "minimum expected population" (defined aoriori for each 
cluster as a function of membership count or a derivative 
thereof) is compared to the actual membership counts (or a 
derivative thereof) during, and after termination of, 
classification. An error condition or warning is generated 
for each cluster evidencing an unexpectedly low population* 
This type of PQA benefits from the unique missing cluster 
stability of the gravitational attractor classification 
method ( i.e. . the attractor will accurately count down to 
absolute zero the occurrence of events in the vicinity where 
the cluster was expected to have presented itself). A check 
on attainment of minimum expected population per each target 
cluster makes the overall autoclustering system vigilant to 
any number of instrumentation, sample preparation, and 
intrinsic sample aberration that express as absent target 
population ( s ) . 

As a second benefit, a "tether" may be employed to define 
the permissible roving distance of each attractor from its 
seed position. A tether length (defined apriori and 
expressed as a scalar distance in multi-space) is compared to 
the actual displacement of c from its starting seed location, 
s, to determine if the tether length has been exceeded. If 
exceeded, an error or warning is generated indicating that a 
cluster has been found too far from its expected location. A 
test on proximity of actual cluster (vector mean) location to 
expected cluster location, per each target cluster, makes the 
overall autoclustering system vigilant to any number of 
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instrumentation, sample preparation, and intrinsic sample 
aberration that express as unreasonable displacements xn 
multi-space cluster location. 

Though other classification methods can yield a 
population vector mean (and can compare proximity to apriorr 
expected location), the attractor method has the unique 
advantage of requiring no list-mode recording. Because the 
tether constraint can be checked each time . the attractor 
moves its position during pre-analysis , it is practical to 
detect cluster position aberrance early in exposure to the 
datastream, thus a time-consuming mega-assay can be 
interrupted early on, rather than waiting untxl xtm 
completion to find out it must be rejected for PQA reasons. 

As a third benefit, a well formal cluster should consist 
of a dense area of events surrounded by a void region. To 
assure proper cluster membership and classification where a 
cluster is less well formed, an orbital band can be placed 
around the cluster membership boundary. The purpose of the 
orbital band is to guard against the movement of a cluster 
too far from its boundary, an unexpected change in the shape 
of a cluster and higher than expected noise. In any or all 
of such situations, a high number of events within the 
orbital band (or "orbiters" ) is an indication that the data 
^ay be unacceptable. Generally, less than 3% of ten events 
for a cluster should fall within the orbital band. 

Referring to FIG. 1, the centroid (1), radius (2) and 
orbital band (2) are shown for a spherical attractor. The 
thickness of the orbital band is arbitrary. A "thin" band 
will include fewer orbiters than a "thick" band. FIG. 2 
shows the movement of all of the components during 
classification . 

A limitation of the hyperspherical attractor (i^, it's 
not being well-fitted to the elongated shape of many multi- 
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space^ data clusters) can be overcome by a modification to the 
gating (or boundary surface) geometry. The characteristic of 
the attractors, whereby each employs an interaction gating 
function , g whose job is to protect its centroid calculation 
from the influence of events in other clusters, makes it 
advantageous to deploy gating geometries that closely 
approximate actual cluster shape- Better fitting boundaries 
allow the targeting of more populations within a fixed-size 
dataspace • 

An adaptation that elongates the spherical attractor is 
to replace its centroid vector, c, with a straight line 
segment in multi-space running between two endpoint vectors, 
* 1 and 02* The line connecting the two endpoints is called 
the attractor 's " center line n . Instead of measuring the 
proximity of an event in terms of its distance from a single 
centerpoint, by extension, proximity is measured in terms of 
distance from the nearest point on the centerline. The locus 
of points equidistant from the centerline gives rise to a 
boundary surface that is a hypercylinder with rounded ends. 
In 3-D space, this solid assumes the shape of a cigar. 

The cigar attractor 's radius, cr, specifies both the 
cigar's cylindrical radius and the radius of curvature of its 
endcaps . 

The midpoint, mp, of the centerline is the center of the 
cigar, and serves as the origin of the cigar's local 
c oor dinat e sy s t em • 

The geometric components of the cigar attractor that 
differ from those of the spherical attractor are: 

Fixed components: seed centerline = t e l s r e 2 s l seed 

endpoints of initial centerline of 
cigar representing approximate 
expected location and orientation of 
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cluster 

cr radius of cigar cylinder 

and endcaps 

variable components: centerline = «2 3 

current endpoints 
mp midpoint of current 

centerline 

The cigar attracted s interaction gating function g(ei) 
for event e£ is: 

g(ei) = distance (e if centerline) < r 

The distance function first finds p, the nearest point on 
centerline to m L (the projection of e ± onto centerline). If 
p projects beyond the end of the centerline, the distance to 
the closest endpoint is computed, otherwise the distance 
between the p and ei is used. 

When the cigar attractor commences an update of its 
location during pre-analysis, the new midpoint mp assumes the 
value of the gated vector mean of all events which have thus 
far interacted. The endpoints of the centerline, maintaining 
rigid values in local coordinates, receive the same delta 
vector as was applied to mp, thus the centerline moves as a 
rigid structure under the pull of combined gravitational 
event force on its midpoint. 

The cigar membership gating function applied during 
classification is the same as g(ei) above. 

The proximity function and centerline update are the only 
aspects of the cigar attractor that differ from the spherical 
attractor. All other behaviors are identical. A primary 
benefit of the cigar attractor is its ability to handle 
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correlated multi-parameter clusters. If two sensory channels 
are identical in their sensitivity and fed the same signal, 
all their 2-D event vectors will fall on the diagonal 
characterized by the equation (x = y) . If two sensory 
channels have partially-overlapping sensitivities and are 
exposed to each other's uncorrelated input signals, the joint 
distribution will retain some diagonal stretch by virtue of 
unintended channel-crosstalk (uncompensated data) . 

Electronic compensation (the subtracting out of cross talk 
components) is difficult to specify as the number of sensory 
channels and cross-talk interactions increases. A more 

practical approach, reduced to practice in this invention, is 
to cluster directly on raw, uncompensated event vectors 
employing a cigar attractor oriented along the principal 
stretch vector of the cluster in multi-space. The 
specification of the centerline endpoints is made by 
observing projections of the cluster in 2-D projection 
scatterplots , whereby two coordinates of the endpoint are 
adjusted at a time using an 2-D locator device. The 
specification of cxr is edited by "pulling" on its appearance 
with a locator device until satisfactory. 

Referring to FIG. 1, the center (3), radius (4) and 
orbital bands are shown for a cigar attractor. FIG. 2 shows 
the movement of these components during classification. 

A slightly different geometry (other than cigar-shaped) 
suitable for elongated clusters is the hyperellipse. The 
attachment of an elliptical boundary surface to the attractor 
behavior claimed herein will be referred to as the elliptical 
attractor. 

The orientation axis of the ellipse is specified by its 
two foci vectors f x and f 2 - The proximity of an event is 
measured in terms of the sum of its two Euclidean distances 
from the two foci, and the ellipse radius, er, specifies the 
upper limit of this sum for the event inclusion. 
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The elliptical attractor's interaction gating function, 
9( e i) for event e i is: 

g(ei) = distance (e ir fi) + distance (ei, f 2 ) < 



er 



The midpoint, mp, of the orientation axis is the center 
of the ellipse, and serves as its local coordinate system 
origin. The specification of the principal axis and the 
gating function are the only two aspects of the elliptical 
attractor that differentiate it from the cigar attractor. As 
in the case of the cigar attractor, positional deltas applied 
to the midpoint propagate to each foci so that the ellipse 
can maintain its fixed orientation, size and shape. 

The purpose of an attractor's classification geometry is 
to suitably enclose its target cluster's event cloud when 
deployed at its center-of-mass . The purpose of its 
interaction geometry is to define a "seek area" in which the 
attractor can expect to find its cluster (and little else) - 
Since these two geometry's serve differing purposes, it is 
sometimes advantageous to customize the geometry's subserving 
interactions and classifications. 

A spherical attractor may employ a "membership radius" 
different from its "interaction radius". Other geometry's 
can be invoked for defining an attractor's interaction and 
membership boundaries ( i.e. , squares, rectangles, tilted 
rectangles, ellipses or arbitrary mouse-drawn regions). In 
general, cluster membership boundaries are chosen to 
approximate the actual size and shape of their target 
clusters. Attractor interaction boundaries are chosen that 
both 1) delimit the scan area where center-of-mass should be 
found and 2) exclude neighboring clusters from possible 
interaction. 
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An attractor can be defined on a subset of arriving 
parameters. Different attractors may be defined on different 
subsets of arriving parameters, if useful for clustering 
their respective populations. A mask, M, or vector of binary 
switches, is stored within each attractor to signify which 
parameters of incoming event vectors are to be attended to 
and which ignored. Since the attractor engine can be 

defined in any N-dimensional space, it can be defined on a 
subset of parameters without embellishment beyond the mere 
requirement to specify M. The vector operations that 
underlie the attractor engine are implemented in such a way 
that masked out parameters are treated as non-existent in a 
completely transparent fashion. The benefits of parameter 
masking are that it 1) permits data clusters to be defined in 
the subset of parameters which affords the sharpest cluster 
definition, 2) allows parameters to be ignored which smear an 
otherwise well-formed cluster and 3) support classification 
at varying degrees of dimensional collapse. The latter 
benefit requires that a single event be permitted to be 
classified by multiple attractors. 

Referring to FIG.s 3 and 4, peripheral whole blood was 
obtained from normal adult volunteers in EDTA containing 
evacuated blood collection tubes. Erythrocyte were lysed in 
a lysing solution comprising NH 4 C1, KHCO3 and EDTA. The 
lysed cells were spun down and removed. 

The remaining cells were placed in a test tube containing 
PBS. To this tube were added, in sequence, Leu 4 FITC (anti- 
CD3; BDIS), Leu 11 + 19 PE (anti-CD16, CD56; BDIS) and Leu 12 
PerCp (anti-CD19; BDIS) . These antibodies will label T 
lymphocytes, NK cells and B lymphocytes respectively. After 
incubation the cells were washed and then run on a FACScan 
brand flow cytometer (BDIS) equipped with Consort FACScan 
Research Software (BDIS). The data was acquired and stored 
in list-mode. 15 , 000 events w re recorded. 
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In FIG. 3, the seed location, s, and radius, r or cr, of 
each population's attractor was identified prior to analysis 
based upon well known and published data. A spherical 
attractor was applied for B lymphocytes while cigar 
attractors were used for NK cells and- T lymphocytes. Each 
attractor then was mouse drawn to represent the expected 
locations of each population when the data was analyzed for 
scatter (A), PE vs. FITC fluorescence (B) and PerCp vs. FITC 
fluorescence (C) . Gray dots are shown interposed on the dot 
plots showing unclustered events. (In other embodiments, it 
will be appreciated that these unclustered events need not be 
displayed in either real time or list-mode analysis.) 

in FIG. 4, the results of classification are displayed 
after all recorded events have been analyzed. The parameters 
measured, and thus included in each event vector, were FSC, 
SSC, log PE fluorescence, log FITC fluorescence and log PerCP 
■ fluorescence. For B lymphocytes, 757 cells (or approximately 
19% of all clustered events) were within this cluster. For T 
lymphocytes, 2596 (or approximately 66% of all clustered 
events) were within the cluster; and for NK cells, 587 events 
were within this cluster. It should be appreciated that the 
data analysis for all of the attractors occurs at the same 
time. FIG. 4 represents the 2-D projection of each attractor 
post analysis- 

Referring to FIG.s 5 and 6, whole blood was obtained from 
an AIDS patient (FIG. 5) and from a normal adult volunteer in 
EDTA containing evaluated blood collection tubes. Each 
sample was split into two aliguots. A mixture of 50,000 
fluorescent microbeads, titered amounts of antibody and 
buffer to make 400^1 was prepared for each aliquot. To one 
aliquot from each sample the antibodies consisted of Leu 4 
PE/Cy5 and Leu 3a PE. (Cy5 was obtained from Biological 
Detection Systems.) To the' other aliquot from each sample 
the antibodies consisted of Leu 4 PE/Cy5 and Leu 2a PE. 
(Leu2a is an anti-CD8 monoclonal antibody available from 
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BDIS.). To the mixture in each aliquot was added 50\il of 
whole blood. The aliquots were incubated for 30 minutes, 
vortex :i >and then run on a FACSCount brand flow cytometer. 
Data was acquired and stored in list-mode. A fluorescence 
threshold was set in the PE/Cy5 channel to exclude . the 
majority of red blood cells, however, care was taken to 
assure that the threshold was to the left of the far most 
expected edge of the CD4" and CD8" attractors. 

Three elliptical attractors were applied to the bead, 
CD4~ and CD4 + or CD8" and CD8+ clusters. One difficulty 
encountered in the analysis of CD8 cells is that, unlike CD4 
cells, CD8 cells do not differentiate into well defined 
positive and negative clusters. A small number of CD8 cells 
will appear to be "dim." These dim cells are CD8 + and 
therefore must included in the count if the absolute is to be 
accurate . 

A new clustering tool was developed to solve this 
problem. A "pipe" is drawn connecting the upper ( i.e. , CD8+) 
cluster with the lower ( i.e. , CD8~) cluster. It is drawn so 
that in a 2-D plot one side extends from the left most edge 
of the upper cluster boundary to the left most edge of the 
lower cluster boundary and the other side extends from right 
most edge of the upper cluster boundary to the right most 
edge of the lower cluster boundary. Any events falling 
within the orbital bands surrounding the cluster boundaries 
of the pipe are monitored as a PQA check assuring proper 
containment of CD8 dim cells and as a PQA check against 
encroachment by debris. 

In addition the pipe region tool described above, an 
additional tool was developed to handle the special case 
where fluorescent control and/or reference beads are included 
in the analysis of f luorescently labelled cells. In this 
instance, a circular 2-D bead peak attractor is used to 
pinpoint the vector mean of the beads, which is then used to 
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predict, by fixed vector offsets, the most likely positions 
of the cell population clusters. The goal is that the bead 
peak location will reveal drift in the optical power 
alignment and sensitivity of the instrument. Any drift xn 
theTead peak predicts similar drift in the cell 
therefore, any offset in the location of the bead peak «xll 
cause the seed locations to be offset by a similar amount xn 
a similar direction. This may be accomplished by a two-step 
analysis where only beads are analyzed initially in order to 
establish the bead peak or by means of analysis of a control 
tube prior to actual sample acquisition. In the former case, 
a circular attractor is employed to establish the bead peak 
while an elliptical attractor is employed in the analyses 
step* 

FIG.s 5(A) and 5(B) display the final positions of the 
clusters and the events that fell within each cluster for 
whole blood from an AIDS patient. In FIG. 5(A), the «3«xty 
of events within a cluster occur within the CD4" or CDS 
clusters. There are few events that fall outside the cluster 
that are not either CD4+ or CD4" T cells or beads. In FIG. 
5(B), the events are distributed in a manner similar to CD4 
cells; however, the pipe region is applied to collect those 
CD8+ cells that express "dim" amounts of fluorescence. Table 
I sets forth the numbers of events that fell withxn each 
cluster as well as those non-red blood cell events that were 
not clustered. 



TABLE I 



- A ^ , CD 8 Tube 

CD4 Tube - _ 

Beads 6729 Orb. Beads 4 Beads 17229 Orb. Beads 17 

CD4 + 874 Orb. CD4 + 56 CDS* 5101 Orb. CD8-426 

Orb. CD4- 223 CDS" 1548 Orb. CD.JH7 

CD 8dxm 586 Orb. CD8 U - JJ " 

117 
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Based upon this data, the number of CD4+ cells per »1 of 
whole blood was calculated as 156? the number of CD3+ cells 
per nl of whole blood was calculated to as 972 in the CD4 
tube and 978 in the CDS tube; and the number of CD8+ cells 
per *1 of whole blood was calculated as 769. The number of 
cells in the. orbital bands was low confirming the integrxty 
of the cluster. 

The data from FIG. 5 is to be compared with the data from 
FIG. 6 to show how this invention provides PQA. For example, 
from FIG. 6(A) it can be seen that the CD4-cluster rs 
contaminated with debris and red blood cells, whereas xn FIG. 
5(A) there is a separation between the red blood cells/debrrs 
and the CD4- cells. This problem also shows up in Table II 
where the number of events occurring in the orbital bands for 
CD4- and CD8~ is higher than should be expected if cluster 
integrity had been maintained. Based on this data, the 
sample in FIG. 6 should have been rejected. 

TABLE II 



CD4 Tube 
Beads 9468 



CD 8 Tube 

Orb. Beads 1 Beadsl7453 Orb. Beads 13 

CD4+ 2501 Orb. CD4+ 69 CD8+ 713 Orb. CD8+ 82 

CD4 - 1519 orb. CD4- 763 CD8" 2501 Orb. CD8 343 

CD8 dim 189 Orb. CD8 dim 

173 

Another aspect of this invention also is shown in Table 
II. For both the CD4 and CD 8 tube, once the number of events 
that were CD4+ exceed 2500, the counting ceased. The 
instrument had been set with 2500 events in the CD4+ window 
as an auto-shut off. The same is true for the number of CD8" 
events in the CD8 tube. 

All publications and patent applications mentioned in 
this specification are indicative of the level of ordinary 
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skill in the art to which this invention pertains. All 
publications and patent applications are herein incorporated 
by reference to the same extent as if each individual 
publication or patent application was specifically and 
individually indicated to be incorporated by reference. 

It will be apparent to one of ordinary skill in the art 
that many changes and modifications can be made in the 
invention without departing from the spirit or scope of the 
appended claims. 
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What, is claimed is : 

1. A method for autoclustering particles into one or 
more clusters wherein mult i -parameter data are collected for 
each particle in a sample of particles comprising the steps 
of: 

a) for each cluster expected in a sample, fixing a 
geometric boundary surface so as to confer membership in the 
cluster that is fixed in shape, size and orientation, but not 
position prior to autoclustering; 

b) setting a seed location and radius for each cluster; 

c) transforming a vector for each particle analyzed into 
a coordinate system, wherein the vector comprises values for 
each parameter collected;. 

d) summing each vector to calculate a vector mean if the 
proximity of that vector is less than the radius distance 
from the center location of the vector; 

e) after a pre-determined number of vectors are added to 
the vector sum to calculate the vector mean calculating the 
center location as the vector mean; 

f) repeating steps c)-e) until a pre-determined number 
of vectors have been included in the calculation of the 
vector mean; 

g) establishing a final geometric boundary based upon 
the last center location calculated; and 

h) comparing all subsequent particle vectors against the 
final boundary for inclusion within or exclusion outside the 
boundary . 

2. The method of claim 1 wherein orbital bands are set 
in step b) for one or more of the clusters. 

3. The method of claim 1 wherein the particles comprise 
cells. 



4. A method for autoclustering blood cells in a sample 
of such cells into two or more clusters wherein 
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multiparameter data are collected for each cells by means of 
flow cytometry comprising the steps of: 

(a) for each cluster expected in the sample, fixing a 
geometric boundary surface so as to confer membership in the 
cluster that is fixed in shape, size and orientation but not 
position prior to autoclustering ; 

(b ) setting a seed location, radius and orbital band for 

each expected cluster; 

(C ) analyzing the cells by means of flow cytometry 
wherein at least two parameters of data are recorded for each 

cell analyzed? , . , 

(d) transforming a vector for each cell analyzed into a 
coordinate system, wherein the vector comprises valves for 
each parameter recorded; 

(e) summing each vector to calculate a vector mean xf 
the proximity of that vector is less than the radius distance 
from the center location of the vector; 

(f) after a pre-determined number of vectors are added 
to the vector sum to calculate the vector mean calculating 
the center location as the vector mean; 

(g) repeating steps c)-e) until a pre-determined number 
of vectors have been included in the calculation of the 
vector mean; 

(h) establishing a final geometric boundary based upon 
the last center location calculated; and 

(i) comparing all subsequent particle vectors agaxnst 
the final boundary for inclusion within or exclusion outsxde 
the boundary. 

5. The method of claim 4 wherein the number of clusters 
is two. 

6. The method of claim 4 wherein the parameters 
recorded comprise at least two measurements of fluorescence 
emissions . 



7. The method of claim 4 wherein the cells comprise T 
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lymphocytes . 

8. The method of claim 7 wherein the clusters comprise 
at least CD4 + and CD4~ cells and CD8 + and CD8~ cells. 

9. The method of claim 4 wherein the cells are labeled 
with at least one markers prior to step a) wherein each 
marker has an emission wavelength that is distinguishable 
from the others. 

10. The method of claim 9 wherein the cells in the 
sample have been labeled with at least immunof luorescent 
markers . 

11. The method of claim 4 wherein the clusters are 
selected from the group consisting of lymphocytes, monocytes, 
granulocytes, platelets and red blood cells. 

12. The method of claim 11 wherein any of the clusters 
is divided into sub-clusters. 

13. The method of claim 4 wherein the number of clusters 
is at least three. 

14. The method of claim 13 wherein one or more of the 
clusters comprises a fluorescent bead population. 

15. The method of claim 14 wherein the bead population 
is sampled and analyzed prior to analysis of the cells in 
order to correct for drift. 

16. The method of claim 8 wherein a pipe region is 
applied between the CD8 + and CD8 - clusters. 

17. The method of claim 4 wherein orbital bands are set 
for each cluster. 
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Abstract We consider the problem of discovering the conceptual clusters from 
a large database. From 2. Pawlak's information system in Tough set theory , we 
define an information matrix, information mappings and some concepts in data 
mining literature such as large sets, association rules and conceptual cluster. We 
propose a combined method of information matrix, Kohonen*s neural network 
for large set discovery and genetic algorithm for conceptual cluster validity. We 
present an application of our method to a student database for discovering the 
rules contributing to the training of the gifted students. 



1 Introduction 

Data Mining (DM) is to discover the interesting patterns present implicitly in large 
database [7]. In this paper, we study the problem of conceptual cluster discovery from 
a large database. This problem is stated as: given a set of objects, conceptual 
clustering discovery is to find clusters of objects based on a conceptual closeness 
among objects [1],[2],[3],[4]. We proposed a method for solving and expanding this 
problem. Based on Z. Pawlak's information system [9], we define an information 
matrix and some concepts then we employ a combined Kohonen's self-organizing 
algorithm (SOA) and Genetic algorithm for conceptual cluster discovery and building 
rules from these discovered concepts. We build an information matrix in the computer 
memory for improving the speed of mining process. The paper is organized as 
follows. Section 1: Introduction. Section 2: Formal definitions. Section 3: Problem 
statement. Section 4: Using SOA for discovering large descriptor sets. Section 5: 
Using GA for cluster validity. Section 6: An application to a student database. Section 
7: Conclusions and future works. 



2 Formal definitions 

In this section, we define an information matrix and some concepts related to our 
proposed method. Based on these definitions, we implement a set of functions for 
processing the mining tasks in the computer memory instead of scanning the whole 
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database in disk. Therefore, we can improve significantly the speed of mining 
process. 



2.1 Definition 1: Information matrix 

Information matrix is defined as B=(0,D) where O{oi,...,o a } is a finite set of n 
objects and D={di,...,da>> is a finite set of m descriptors. Let bij (i=l,...,n and 
j=l,...,m) be the element of matrix B, bjpl if o, has dj, otherwise by-O. 



2.2 Definition 2: Information mappings 

Given a finite set O of n objects and a finite set D of m descriptors [5]. Let P(D) be a 
power set of D, P(0) be a power set of O. Information mapping x is defined as: 
x d -K0,1}. Given oeO and deD, x(o,d) = 1 if o has d, otherwise x(°> d )=0. 
Mappingsp and X are defined as: p:P(D)-*P(0) and X: P(0) -»P(D) where: 
Given ScD then p(S) = {o € O: VdeS , x(o,d)=l} 
Given XcO then X(X) ={deD: VoeX, x(o,d) = 1 } 



2.3 Definition 3: Large descriptor set 

Given an information matrix B=(0,D) and a threshold x which is the MINSUP of the 
large item set in data mining literature [7]. A large descriptor set S is a subset of D that 
satisfy condition: Card(p(S))/Card(0)>-c, where Card is the cardinality of set. 



2.4 Definition 4: Binary association rule 

Given an information matrix B=(0,D) and a threshold t. Let S be a large descriptor 
set of B. Let Li , Lj be the subsets of S. A binary association rule with threshold t is a 
mapping from Li to Lj and is denoted as L L — > Lj. 



2.5 Definition 5: Confidence factor of a binary association rule 

Let S be a large descriptor set of B, U , Lj be the subsets of S, Li -> Lj be a binary 
association rule with a threshold t. The confidence factor CF(Li -» Lj) of this rule is 
calculated by Card(p(Lj)r^p( Lj) ) / Card(p(L,)). 

2.6 Definition 6: Concept 

A concept is a pair O (X,S) where XcO and SqD. X and S satisfy following 
conditions: 

a) Xc p(S) and X(X) = S 

b) V Li , Lj c S and CardQLO = Card(Lj) - 1 then p( L ; ) c p( Lj ). 
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3 Problem statement 

Problem 1 Given an information matrix B and a threshold t, find all large descriptor 
sets of B. The large descriptor set determines the popular descriptors of data objects. 
The threshold t determines a measure of popularity [7J. 

Problem 2 Given an information matrix B and a threshold t, find k conceptual 
clusters C u ...,C k where Cj - (Xj, Sj). These conceptual clusters satisfy: a) rSXj - 0 
for i=l k * b) ^ - 0 for Jc ; c) Card(XD/Card(0)>=T; d) Maximize the 

ratio Card(X, 0...oXkyCard(Q) e) Q is a concept. Conceptual cluster de^rmines an 
object set that has the same set of descriptors. Based on the concept C=(X,S), we 
build rule L, L; where L^Lj =S and LjoLj =0. It means that if object has all the 
descriptors of L ; (rule antecedent) then object has all the descriptors of Lj (rule 
consequent). 



4 Using SOA for discovering large descriptor sets 

In this section, we employ SOA for discovering the potential large descriptor sets [6J. 
SOA can be summarized as follows: 

Step 1. Initialize all weight vectors of Kohonen's neural network 
Step 2. Select the node with minimum distance d* to the input vector v(t). 
Step 3. Update weight vectors of nodes that lie within a nearest neighbor set 
ofthenode(i c j c ): w 0 <t+l) = Wy<t) + a(t)(v(t>Wii(t) ) 
for i c -N c (t) <= i <= i c +Nc(t) and j c -Nc(t) <= j <= j c +N c (t) 
Step 4. Update time t = t+ 1 , add new input vector and go to (Step 2) 
In the above algorithm, dv is Euclidean distance, a(t) is a gain ratio (0<=a(t)<=D and 
N c (t) is the radius of neighbor set. N c (t) and ct(t) are decreased monotonically with 
time The algorithm finishes when ct(t) =0 or N c (t)=0. 

Given an information matrix in table 1, each row of this matrix corresponds to ar 
input vector of Kohonen's neural network. 



Table I. An information matrix for large descriptor set discovery. 
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After running SOA, we have the potential large descriptor sets: 



451 



{d,, d 2 , d 3 >, {d*, d 3 , d«}, {d u d,, d 3> dU}. With t=50%, {d } , d 2 , d 3 }, {cU, d 5 , d«} are 
large descriptor set, {d,, d 2 , d 3 , <U} is not a large descriptor set because Card(p({d )> 
d2, d 3> d 1 }))/Card(0>=33-3%< t. 



5 Using GA for cluster validity 

Large descriptor sets discovered by SOA are used for building the initial GA 
population. We hold that the subset of a large descriptor set is also a large descriptor 
set [7). Let L={L 1 ,...,L k } be a set of k large descriptor sets, we employ GA[8] for 
finding a set {S^..^} where S t c Li (i=l,...,k) and (S^pCsO) is a concept. A 
chromosome is a set of BSi, each BS; is a bit string corresponding to a large descriptor 
set. With two large descriptor sets {d x> cfe, d 3 > and {d,, <W, we have chromosome 
{df.l, d 2 :l, d 3 :l, d«:l, d 5 :0, de l). The genetic representation of population P is a set 
of chromosomes. A typical population P with 3 chromosomes is as follows: 
P(t>= {111111, 10011 1,001 100}. The genetic operations are defined as: 



5.1 Crossover operator 

Given two parental chromosomes: {ai, a2, a 3 , aj, a 5 , a*} and {b u b 2) t>5, b 4> b 5 , b 6 } 
where ^ b- t e {0,l}(i=l,...,6). The crossover will swap a portion of two parental 
chromosomes and yield the offspring: {a,, a 2 , a 3 , b 4 , b 5 , b 6 }and {bj, b2, b 3 , a,, a 5 , a*}. 



5.2 Mutation operator 

Given a chromosome {a iy a 2> a 3 , a*, a 5 , ao) Select a random position he[1..6]. Let h 
be the selected position, if au - 1 then is changed to 0 and vice versa. 



5.3 Fitness factor and fitness value 

Fitness factor S^: Let S u be a subset of chromosome BSi, we build set Q containing 
all two-element subsets of Let {a, b} be an element of Q. From {a, b}, we build 
two rules {a} -> {b} and {b} {a} and calculate the CFs of these rules. The Fitness 
factor of Sy is the average of CF. of 2xCard(Q) rules which are built up from Q. 
Fitness value of a chromosome BS, is the average of fitness factor of all S ;j in 
chromosome BS^ 



6 An application to a student database 

We employ our proposed method for discovering the conceptual clusters from a 
student database. An information matrix with 1000 rows and 100 columns is built up 
from this database. In this matrix, each row corresponds to a record and each column 
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corresponds to a descriptor. Some descriptors of the information matrix are "parent of 
student are teachers"; "student is ranked in good level of learning"; "student wins a 
prize of computer science competition" 

The size of Kohonen's output layer is 100x100. With the threshold x=0.7 (70%), we 
discover some large descriptor as {student wins a prize of a math competition; student 
is interested in math}; {student is ranked in good level of learning,; parents of student 
are teachers}; {student is interested in math; student is interested in foreign language; 
student is interested in computer science}. 

We employ the following values for GA parameters: number of chromosomes is 50; 
number of generations is 300; crossover probability is 0.1; mutation probability is 0.1. 
The GA give us some discovered conceptual clusters as {student is ranked in good 
level of learning; student has good behavior; parents of student are teachers; Student 
has the self-learning time greater than 6 hours every day}; {student is interested in 
math; student is interested in foreign language; student is interested in computer 
science}; {student lives in country; income of student family is lower than $100 every 
month; student is ranked in fair level of learning}. 



7 Conclusions and future works 

We gathered some preliminary result in using a combined information matrix, GA 
and SOA for cluster discovery in data mining. The experiment shows very encourage 
in large data set. A matrix expressed in bit is also used for keeping the whole 
information matrix in main memory to increase the efficiency of conceptual cluster 
discovery. We continue to study how to change binary information matrix to fuzzy 
information matrix and use fuzzy cluster discovery for the fuzzy database. 
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